We begin by considering math word problems of the form in Figure 1, which measure the arithmetic
reasoning ability of language models. Though simple for humans, arithmetic reasoning is a task where
language models often struggle (Hendrycks et al., 2021; Patel et al., 2021, inter alia). Strikingly, chain-
of-thought prompting when used with the 540B parameter language model performs comparably with
task-specific finetuned models on several tasks, even achieving new state of the art on the challenging
GSM8K benchmark (Cobbe et al., 2021).
